GPU Computing in Julia

Biostat/Biomath M257

Author

Dr. Hua Zhou @ UCLA

Published

April 18, 2025

This lecture introduces GPU computing in Julia.

1 GPGPU

GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today’s typical computer systems.

NVIDIA GPUs H100 PCIe RTX 6000 RTX 5000
H100 RTX 6000 RTX 5000
Computers servers, cluster desktop laptop
Server Desktop Laptop
Main usage scientific computing daily work, gaming daily work
Memory 80 GB 48 GB 16 GB
Memory bandwidth 2 TB/sec 960 GB/sec 576 GB/sec
Number of cores ??? ??? ???
Processor clock ??? GHz ??? GHz ??? GHz
Peak DP performance 26 TFLOPS ??? TFLOPS ??? TFLOPS
Peak SP performance 51 TFLOPS 91.1 TFLOPS 42.6 TFLOPS

2 GPU architecture vs CPU architecture

  • GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC

  • Each core carries out the same operations in parallel on different input data – single program, multiple data (SPMD) paradigm

  • Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly

i7 die Fermi die
Einstein Rain man

3 GPGPU in Julia

GPU support by Julia is under active development. Check JuliaGPU for currently available packages.

There are multiple paradigms to program GPU in Julia, depending on the specific hardware.

  • CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, …

    The CUDA.jl package allows defining arrays on Nvidia GPUs and overloads many common operations.

  • The AMDGPU.jl package allows defining arrays on AMD GPUs and overloads many common operations.

  • The Metal.jl package allows defining arrays on Apple Silicon GPU and overloads many common operations.

    AppleAccelerate.jl wraps the macOS Accelerate framework, which provides high-performance libraries for linear algebra, signal processing, and image processing on Apple Silicon CPU. This is analog of MKL for Intel CPU.

  • The oneAPI.jl package allows defining arrays on Intel GPUs and overloads many common operations.

I’ll illustrate using Metal.jl on my MacBook Pro running MacOS Sequoia 15.4. It has Apple M2 chip with 38 GPU cores.

versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code

Load packages:

using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
  Activating project at `~/Documents/github.com/ucla-biostat-257/2025spring/slides/09-juliagpu`
Status `~/Documents/github.com/ucla-biostat-257/2025spring/slides/09-juliagpu/Project.toml`
  [13e28ba4] AppleAccelerate v0.4.0
  [6e4b80f9] BenchmarkTools v1.6.0
  [bdcacae8] LoopVectorization v0.12.172
  [dde4c033] Metal v1.5.1
  [37e2e46d] LinearAlgebra v1.11.0

4 Query GPU devices in the system

using Metal, AppleAccelerate

Metal.versioninfo()
macOS 15.4.0, Darwin 24.4.0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

Julia packages: 
- Metal.jl: 1.5.1
- GPUArrays: 11.2.2
- GPUCompiler: 1.3.2
- KernelAbstractions: 0.9.34
- ObjectiveC: 3.4.1
- LLVM: 9.2.0
- LLVMDowngrader_jll: 0.6.0+0

1 device:
- Apple M2 Max (3.000 GiB allocated)

5 Transfer data between main memory and GPU

using Random
Random.seed!(257)

# generate SP data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = MtlArray(x)
3×3 MtlMatrix{Float32, Metal.PrivateStorage}:
 0.145793  0.939801  0.479926
 0.567772  0.577251  0.81655
 0.800538  0.38893   0.914135
# generate array on GPU directly
# yd = Metal.ones(3, 3)
yd = MtlArray(ones(Float32, 3, 3))
3×3 MtlMatrix{Float32, Metal.PrivateStorage}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
# collect data from GPU to CPU
x = collect(xd)
3×3 Matrix{Float32}:
 0.145793  0.939801  0.479926
 0.567772  0.577251  0.81655
 0.800538  0.38893   0.914135

6 Linear algebra

using BenchmarkTools, LinearAlgebra, Random

Random.seed!(257)

n = 2^14
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = MtlArray(x)
yd = MtlArray(y)
zd = MtlArray(z);

6.1 Dot product

# SP matrix dot product on CPU: tr(X'Y)
bm_cpu = @benchmark dot($x, $y)
BenchmarkTools.Trial: 138 samples with 1 evaluation per sample.
 Range (minmax):  34.377 ms56.968 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     34.923 ms               GC (median):    0.00%
 Time  (mean ± σ):   36.388 ms ±  3.159 ms   GC (mean ± σ):  0.00% ± 0.00%
  ▆██▃ ▁      ▁                                               
  ██████▅▇▁▅▁█▇▇▁▅▇▅▇▇▇▁▅█▁▁▁▇▁▅▅▁▁▅▁▁▁▅▁▅▁▅▁▅▁▁▁▅▁▁▅▁▇▁▅▁▅ ▅
  34.4 ms      Histogram: log(frequency) by time      45.5 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# SP matrix dot product on CPU using Apple Accelerate: tr(X'Y)
bm_acc = @benchmark AppleAccelerate.dot($x, $y)
BenchmarkTools.Trial: 143 samples with 1 evaluation per sample.
 Range (minmax):  34.379 ms 40.565 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     35.034 ms                GC (median):    0.00%
 Time  (mean ± σ):   35.193 ms ± 982.134 μs   GC (mean ± σ):  0.00% ± 0.00%
    █▂    ▂                                                   
  ▃████▆██▅▅▄▃▁▃▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▃▁▁▁▃▁▁▃ ▃
  34.4 ms         Histogram: frequency by time         39.9 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# SP matrix dot product on GPU: tr(X'Y)
# why are there allocations?
bm_gpu = @benchmark Metal.@sync dot($xd, $yd)
BenchmarkTools.Trial: 662 samples with 1 evaluation per sample.
 Range (minmax):  7.447 ms 8.989 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.539 ms               GC (median):    0.00%
 Time  (mean ± σ):   7.556 ms ± 93.986 μs   GC (mean ± σ):  0.00% ± 0.00%
        ▁ ▃█▃▅▁▁▁▃                                         
  ▃▄▄▄▇▇█▆██████████▆▇█▆▇▇▅█▇▇▆▄▄▄▄▂▄▂▃▂▄▂▃▃▂▂▃▃▂▁▃▁▂▂▁▃▂▂ ▄
  7.45 ms        Histogram: frequency by time        7.78 ms <
 Memory estimate: 21.24 KiB, allocs estimate: 837.
# speedup by Apple Accelerate
median(bm_acc.times) / median(bm_cpu.times)
1.0031748038858854
# speedup on GPU over CPU
median(bm_cpu.times) / median(bm_gpu.times)
4.632175281234197

6.2 Broadcast

# SP broadcast on CPU: z .= x .* y
bm_cpu = @benchmark $z .= $x .* $y
BenchmarkTools.Trial: 142 samples with 1 evaluation per sample.
 Range (minmax):  34.871 ms 36.395 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     35.260 ms                GC (median):    0.00%
 Time  (mean ± σ):   35.330 ms ± 307.233 μs   GC (mean ± σ):  0.00% ± 0.00%
   ▄ ▄     ▁██▁▆▄▃▄▃   ▃    ▁     ▄                           
  ▆█▆█▇▁▆▇▇█████████▇▄▆█▇▇▇▁█▄▆▁▄▄█▄▇▆▄▄▄▄▇▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▄▄ ▄
  34.9 ms         Histogram: frequency by time         36.3 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.
# SP broadcast on GPU: z .= x .* y
# why is there allocation?
bm_gpu = @benchmark Metal.@sync $zd .= $xd .* $yd
BenchmarkTools.Trial: 564 samples with 1 evaluation per sample.
 Range (minmax):  8.755 ms 9.588 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.855 ms               GC (median):    0.00%
 Time  (mean ± σ):   8.867 ms ± 76.948 μs   GC (mean ± σ):  0.00% ± 0.00%
      ▁ ▁▂ ▃▆▆▂▁█▄ ▂▂▄▇ ▂                                   
  ▄▃▄██▇██████████████▇█▇▆▄▅▇▄▆▄▃▃▃▃▃▂▃▃▂▁▂▁▂▁▂▁▂▂▁▁▁▁▂▁▂ ▄
  8.76 ms        Histogram: frequency by time        9.12 ms <
 Memory estimate: 4.53 KiB, allocs estimate: 177.
# speedup
median(bm_cpu.times) / median(bm_gpu.times)
3.9818234247150652

6.3 Matrix multiplication

# SP matrix multiplication on GPU
bm_gpu = @benchmark Metal.@sync mul!($zd, $xd, $yd)
BenchmarkTools.Trial: 6 samples with 1 evaluation per sample.
 Range (minmax):  913.740 ms  1.000 s   GC (min … max): 0.00% … 0.00%
 Time  (median):     982.052 ms               GC (median):    0.00%
 Time  (mean ± σ):   963.723 ms ± 37.417 ms   GC (mean ± σ):  0.00% ± 0.00%
  ▁  ▁                                            ▁         ▁  
  █▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█ ▁
  914 ms          Histogram: frequency by time             1 s <
 Memory estimate: 960 bytes, allocs estimate: 54.

For this problem size on this machine, we see GPU achieves a staggering 9 TFLOPS throughput with single precision!

# SP throughput on GPU
(2n^3) / (minimum(bm_gpu.times) / 1e9)
9.626476494557188e12
# SP matrix multiplication on CPU
bm_cpu = @benchmark mul!($z, $x, $y)
BenchmarkTools.Trial: 2 samples with 1 evaluation per sample.
 Range (minmax):  3.901 s 3.905 s   GC (min … max): 0.00% … 0.00%
 Time  (median):     3.903 s              GC (median):    0.00%
 Time  (mean ± σ):   3.903 s ± 2.599 ms   GC (mean ± σ):  0.00% ± 0.00%
                              ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.9 s         Histogram: frequency by time         3.9 s <
 Memory estimate: 0 bytes, allocs estimate: 0.
# SP throughput on CPU
(2n^3) / (minimum(bm_cpu.times) / 1e9)
2.2548119538396475e12

We see >10x speedup by GPUs in this matrix multiplication example.

# cholesky on Gram matrix
# This one doesn't seem to work on Apple M2 chip yet
xtxd = xd'xd + I
@benchmark Metal.@sync cholesky($(xtxd))
ArgumentError: ArgumentError: cannot take the CPU address of a MtlMatrix{Float32, Metal.PrivateStorage}
ArgumentError: cannot take the CPU address of a MtlMatrix{Float32, Metal.PrivateStorage}



Stacktrace:

  [1] unsafe_convert(::Type{Ptr{Float32}}, x::MtlMatrix{Float32, Metal.PrivateStorage})

    @ Metal ~/.julia/packages/Metal/N2ABH/src/array.jl:266

  [2] potrf!(uplo::Char, A::MtlMatrix{Float32, Metal.PrivateStorage})

    @ LinearAlgebra.LAPACK ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/lapack.jl:3293

  [3] _chol!

    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/cholesky.jl:187 [inlined]

  [4] #cholesky!#163

    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/cholesky.jl:268 [inlined]

  [5] cholesky!

    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/cholesky.jl:267 [inlined]

  [6] cholesky!(A::MtlMatrix{Float32, Metal.PrivateStorage}, ::NoPivot; check::Bool)

    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/cholesky.jl:301

  [7] cholesky! (repeats 2 times)

    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/cholesky.jl:295 [inlined]

  [8] _cholesky

    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/cholesky.jl:411 [inlined]

  [9] cholesky (repeats 2 times)

    @ ~/.julia/juliaup/julia-1.11.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/cholesky.jl:401 [inlined]

 [10] macro expansion

    @ ~/.julia/packages/Metal/N2ABH/src/utilities.jl:10 [inlined]

 [11] var"##core#340"(xtxd#339::MtlMatrix{Float32, Metal.PrivateStorage})

    @ Main ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:598

 [12] var"##sample#341"(::Tuple{MtlMatrix{Float32, Metal.PrivateStorage}}, __params::BenchmarkTools.Parameters)

    @ Main ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:607

 [13] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; maxevals::Int64, kwargs::@Kwargs{})

    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:186

 [14] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)

    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:181

 [15] #invokelatest#2

    @ ./essentials.jl:1055 [inlined]

 [16] invokelatest

    @ ./essentials.jl:1052 [inlined]

 [17] #lineartrial#46

    @ ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:51 [inlined]

 [18] lineartrial

    @ ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:50 [inlined]

 [19] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::@Kwargs{})

    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:299

 [20] tune!

    @ ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:288 [inlined]

 [21] tune!(b::BenchmarkTools.Benchmark)

    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:288

 [22] top-level scope

    @ ~/.julia/packages/BenchmarkTools/1i1mY/src/execution.jl:461
xtx = collect(xtxd)
@benchmark LinearAlgebra.cholesky($(Symmetric(xtx)))
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 7.649 s (0.00% GC) to evaluate,
 with a memory estimate of 1.00 GiB, over 3 allocations.
@benchmark AppleAccelerate.cholesky($(Symmetric(xtx)))
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 7.635 s (0.00% GC) to evaluate,
 with a memory estimate of 1.00 GiB, over 3 allocations.

We don’t see GPU speedup of Cholesky at the moment.

7 Evaluation of elementary and special functions on GPU

7.1 Sine and log functions

# elementwise function on GPU arrays
fill!(yd, 1)
bm_gpu = @benchmark Metal.@sync $zd .= log.($yd .+ sin.($xd))
bm_gpu
BenchmarkTools.Trial: 563 samples with 1 evaluation per sample.
 Range (minmax):  8.769 ms 9.569 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.871 ms               GC (median):    0.00%
 Time  (mean ± σ):   8.879 ms ± 67.372 μs   GC (mean ± σ):  0.00% ± 0.00%
        ▁▁▁ ▂▂▃█▄▆▁▄▁▃▂▄▃                                  
  ▂▃▅▆▆▆███▇██████████████▇▅▆▇▇▄▄▄▅▂▄▄▄▃▁▃▁▂▃▁▁▁▂▂▂▂▁▁▁▁▂ ▄
  8.77 ms        Histogram: frequency by time         9.1 ms <
 Memory estimate: 4.53 KiB, allocs estimate: 177.
# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
bm_cpu = @benchmark $z .= log.($y .+ sin.($x))
bm_cpu
BenchmarkTools.Trial: 2 samples with 1 evaluation per sample.
 Range (minmax):  2.750 s 2.755 s   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.753 s              GC (median):    0.00%
 Time  (mean ± σ):   2.753 s ± 3.053 ms   GC (mean ± σ):  0.00% ± 0.00%
                              ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.75 s        Histogram: frequency by time        2.75 s <
 Memory estimate: 0 bytes, allocs estimate: 0.
# Speed up
median(bm_cpu.times) / median(bm_gpu.times)
310.27966704755283

GPU brings great speedup (>50x) to the massive evaluation of elementary math functions.

7.2 tanh function

bm_cpu = @benchmark z .= tanh.($x) # on CPU
bm_cpu
BenchmarkTools.Trial: 4 samples with 1 evaluation per sample.
 Range (minmax):  1.600 s 1.608 s   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.605 s              GC (median):    0.00%
 Time  (mean ± σ):   1.604 s ± 3.526 ms   GC (mean ± σ):  0.00% ± 0.00%
  █                               █          █           █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.6 s         Histogram: frequency by time        1.61 s <
 Memory estimate: 16 bytes, allocs estimate: 1.
bm_acc = @benchmark z .= AppleAccelerate.tanh.($x) # AppleAccelerate
bm_acc
BenchmarkTools.Trial: 27 samples with 1 evaluation per sample.
 Range (minmax):  190.344 ms192.504 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     191.780 ms                GC (median):    0.00%
 Time  (mean ± σ):   191.633 ms ± 508.052 μs   GC (mean ± σ):  0.00% ± 0.00%
                                 █      █       ▃            
  ▇▁▁▁▁▁▁▇▁▁▁▇▇▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁█▁▁▇▁█▁▇▇█▁▇▇▇▇▁▇█▁▁▁▁▁▁▁▁▁▇ ▁
  190 ms           Histogram: frequency by time          193 ms <
 Memory estimate: 16 bytes, allocs estimate: 1.

AppleAccelerate.jl accelerates the evaluation of tanh function by

median(bm_cpu.times) / median(bm_acc.times)
8.370156534894807
bm_mtl = @benchmark zd .= Metal.@sync tanh.($xd) # Metal
bm_mtl
BenchmarkTools.Trial: 74 samples with 1 evaluation per sample.
 Range (minmax):  42.764 ms   3.338 s   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.531 ms                GC (median):    0.00%
 Time  (mean ± σ):   99.056 ms ± 384.921 ms   GC (mean ± σ):  0.00% ± 0.00%
  █       ▆▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  42.8 ms       Histogram: log(frequency) by time       477 ms <
 Memory estimate: 8.93 KiB, allocs estimate: 335.

Metal.jl accelerates the evaluation of tanh function by

median(bm_cpu.times) / median(bm_mtl.times)
33.77199471480576